IAGLR 2021 H2O Workshop
1 Agenda
- Welcomes and introductions – Timothy Maguire (UMich CIGLR) and Jo-fai Chow (H2O.ai) (5 mins)
- Great Lakes research questions driven by data – Timothy Maguire (15 mins)
- Introduction to the H2O.ai software – Jo-fai Chow (5 mins)
- Live-code example of data manipulation and machine learning analysis – Jo-Fai Chow (20 mins)
- Results in context – Timothy Maguire (10 mins)
- Q&A (5 mins)
2 Introduction to H2O
TBA
3 Software and Code
3.1 Code
setup.R: install packages requiredtutorial.Rmd: the main RMarkdown file with codetutorial.html: this webpage- GitHub Repo https://github.com/woobe/IAGLR_2021_H2O_Workshop
3.2 R Packages
- Check out
setup.R - For this tutorial:
h2ofor automatic and explainable machine learning
- For RMarkdown
knitrfor rendering this RMarkdownrmdformatsforreadthedownRMarkdown templateDTfor nice tables
4 H2O Basics
# Let's go
library(h2o) # for H2O Machine Learning4.1 Start a local H2O Cluster (JVM)
h2o.init() Connection successful!
R is connected to the H2O cluster:
H2O cluster uptime: 6 minutes 51 seconds
H2O cluster timezone: Europe/London
H2O data parsing timezone: UTC
H2O cluster version: 3.32.1.2
H2O cluster version age: 14 days, 22 hours and 13 minutes
H2O cluster name: H2O_started_from_R_joe_ljh865
H2O cluster total nodes: 1
H2O cluster total memory: 3.33 GB
H2O cluster total cores: 4
H2O cluster allowed cores: 4
H2O cluster healthy: TRUE
H2O Connection ip: localhost
H2O Connection port: 54321
H2O Connection proxy: NA
H2O Internal Security: FALSE
H2O API Extensions: Amazon S3, XGBoost, Algos, AutoML, Core V3, TargetEncoder, Core V4
R Version: R version 4.0.5 (2021-03-31)
h2o.no_progress() # disable progress bar for RMarkdown
h2o.removeAll() # Optional: remove anything from previous session # Enter your lucky seed here ...
n_seed <- 123455 Data - Dom Lake Huron Abund
lake_data <- h2o.importFile('https://raw.githubusercontent.com/woobe/IAGLR_2021_H2O_Workshop/main/data/Dom_Lake_Huron_Abund.csv')datatable(as.data.frame(head(lake_data)), rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)h2o.describe(lake_data) Label Type Missing Zeros PosInf NegInf Min Max Mean
1 C1 int 0 0 0 0 1.0000 885.00000 443.00000
2 Year int 0 0 0 0 2006.0000 2012.00000 2009.00113
3 Region enum 0 96 0 0 0.0000 3.00000 NA
4 Station enum 0 24 0 0 0.0000 122.00000 NA
5 Replicate int 0 1 0 0 0.0000 3.00000 1.99887
6 lat real 0 0 0 0 43.2695 46.23333 44.68487
Sigma Cardinality
1 255.6217909 NA
2 2.3727809 NA
3 NA 4
4 NA 123
5 0.8190319 NA
6 0.8554837 NA
[ reached 'max' / getOption("max.print") -- omitted 62 rows ]
h2o.hist(lake_data$DPOL, breaks = 100)h2o.hist(lake_data$DBUG, breaks = 100)5.1 Define Target and Features
target_DPOL <- "DPOL"
target_DBUG <- "DBUG"
# Remove targets, C1, and Dreisseniidae (which is DPOL + DBUG)
features <- setdiff(colnames(lake_data), c(target_DPOL, target_DBUG, "C1", "Dreisseniidae"))
print(features) [1] "Year" "Region" "Station"
[4] "Replicate" "lat" "lon"
[7] "depth" "substrate" "Season"
[10] "Na2O" "Magnesium" "MagnesiumOxide"
[13] "AluminumOxide" "SiliconDioxide" "Quartz"
[16] "PhosphorusPentoxide" "Sulphur" "DDT_TOTAL"
[19] "P.P.TDE" "P.P.DDE" "P.P.DDE.1"
[22] "HeptachlorEpoxide" "Potassium" "PotassiumOxide"
[25] "Calcium" "CalciumOxide" "TitaniumDioxide"
[28] "Chromium" "Manganese" "ManganeseOxide"
[31] "XTR.Iron" "Iron" "IronOxide"
[34] "Cobalt" "Nickel" "Copper"
[37] "Zinc" "Selenium" "Strontium"
[40] "Beryllium" "Silver" "Cadmium"
[43] "Tin" "TotalCarbon" "OrganicCarbon"
[46] "Carbon.Nitrogen_Ratio" "TotalNitrogen" "Mercury"
[49] "Lead" "Uranium" "Vanadium"
[52] "Arsenic" "Chloride" "Fluoride"
[55] "Sand" "Silt" "Clay"
[58] "Mean_grainsize" "simpsonD" "shannon"
[61] "Chironomidae" "Oligochaeta" "Sphaeriidae"
[64] "Diporeia"
5.2 Split Data into Train/Test
h_split <- h2o.splitFrame(lake_data, ratios = 0.75, seed = n_seed)
h_train <- h_split[[1]] # 80% for modelling
h_test <- h_split[[2]] # 20% for evaluationdim(h_train)[1] 656 68
dim(h_test)[1] 229 68
5.3 Cross-Validation
6 Worked Example - Target “DPOL”
6.1 Baseline Models
h2o.glm(): H2O Generalized Linear Modelh2o.randomForest(): H2O Random Forest Modelh2o.gbm(): H2O Gradient Boosting Modelh2o.deeplearning(): H2O Deep Neural Network Modelh2o.xgboost(): H2O wrapper for eXtreme Gradient Boosting Model (XGBoost) from DMLC
Let’s start with GBM
model_gbm_DPOL <- h2o.gbm(x = features, # All 13 features
y = target_DPOL, # medv (median value of owner-occupied homes in $1000s)
training_frame = h_train, # H2O dataframe with training data
nfolds = 5, # Using 5-fold CV
seed = n_seed) # Your lucky seed# Cross-Validation
model_gbm_DPOL@model$cross_validation_metricsH2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 4620.394
RMSE: 67.97348
MAE: 16.30104
RMSLE: NaN
Mean Residual Deviance : 4620.394
# Evaluate performance on test
h2o.performance(model_gbm_DPOL, newdata = h_test)H2ORegressionMetrics: gbm
MSE: 2613.399
RMSE: 51.12142
MAE: 12.70523
RMSLE: NaN
Mean Residual Deviance : 2613.399
Let’s use RMSE
Build Other Baseline Models (GLM, DRF, GBM, DNN) - TRY IT YOURSELF!
# Try other H2O models
# model_glm <- h2o.glm(x = features, y = target, ...)
# model_gbm <- h2o.gbm(x = features, y = target, ...)
# model_drf <- h2o.randomForest(x = features, y = target, ...)
# model_dnn <- h2o.deeplearning(x = features, y = target, ...)
# model_xgb <- h2o.xgboost(x = features, y = target, ...)6.2 Manual Tuning
6.2.1 Check out the hyper-parameters for each algo
?h2o.glm
?h2o.randomForest
?h2o.gbm
?h2o.deeplearning
?h2o.xgboost6.2.2 Train a xgboost model with manual settings
model_gbm_DPOL_m <- h2o.gbm(x = features,
y = target_DPOL,
training_frame = h_train,
nfolds = 5,
seed = n_seed,
# Manual Settings based on experience
learn_rate = 0.1, # use a lower rate (more conservative)
ntrees = 120, # use more trees (due to lower learn_rate)
sample_rate = 0.7, # use random n% of samples for each tree
col_sample_rate = 0.7) # use random n% of features for each tree6.2.3 Comparison (RMSE: Lower = Better)
# Create a table to compare RMSE from different models
d_eval <- data.frame(model = c("GBM: Gradient Boosting Model (Baseline)",
"GBM: Gradient Boosting Model (Manual Settings)"),
stringsAsFactors = FALSE)
d_eval$RMSE_cv <- NA
d_eval$RMSE_test <- NA
# Store RMSE values
d_eval[1, ]$RMSE_cv <- model_gbm_DPOL@model$cross_validation_metrics@metrics$RMSE
d_eval[2, ]$RMSE_cv <- model_gbm_DPOL_m@model$cross_validation_metrics@metrics$RMSE
d_eval[1, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL, newdata = h_test))
d_eval[2, ]$RMSE_test <- h2o.rmse(h2o.performance(model_gbm_DPOL_m, newdata = h_test))
# Show Comparison (RMSE: Lower = Better)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)6.3 H2O AutoML
# Run AutoML (try n different models)
# Check out all options using ?h2o.automl
automl_DPOL = h2o.automl(x = features,
y = target_DPOL,
training_frame = h_train,
nfolds = 5, # 5-fold Cross-Validation
max_models = 30, # Max number of models
stopping_metric = "RMSE", # Metric to optimize
exclude_algos = c("deeplearning", "xgboost"), # exclude some alogs for a quick demo
seed = n_seed)6.3.1 Leaderboard
datatable(as.data.frame(automl_DPOL@leaderboard),
rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)6.3.2 Best Model (Leader)
automl_DPOL@leaderModel Details:
==============
H2ORegressionModel: gbm
Model ID: GBM_grid__1_AutoML_20210514_152911_model_7
Model Summary:
number_of_trees number_of_internal_trees model_size_in_bytes min_depth
1 43 43 10605 4
max_depth mean_depth min_leaves max_leaves mean_leaves
1 4 4.00000 9 16 13.79070
H2ORegressionMetrics: gbm
** Reported on training data. **
MSE: 742.9795
RMSE: 27.25765
MAE: 7.378285
RMSLE: NaN
Mean Residual Deviance : 742.9795
H2ORegressionMetrics: gbm
** Reported on cross-validation data. **
** 5-fold cross-validation on training data (Metrics computed for combined holdout predictions) **
MSE: 4022.796
RMSE: 63.42551
MAE: 15.04154
RMSLE: NaN
Mean Residual Deviance : 4022.796
Cross-Validation Metrics Summary:
mean sd cv_1_valid cv_2_valid cv_3_valid
mae 15.045994 3.9332395 12.12232 11.87795 17.959179
mean_residual_deviance 4025.1875 2845.8782 2456.1912 2354.3364 8599.125
mse 4025.1875 2845.8782 2456.1912 2354.3364 8599.125
r2 0.5540781 0.14167649 0.5431397 0.5116641 0.44749096
residual_deviance 4025.1875 2845.8782 2456.1912 2354.3364 8599.125
rmse 60.5994 21.002983 49.559975 48.521503 92.73147
rmsle NaN 0.0 NaN NaN NaN
cv_4_valid cv_5_valid
mae 20.492651 12.777869
mean_residual_deviance 4990.901 1725.3843
mse 4990.901 1725.3843
r2 0.46939796 0.7986977
residual_deviance 4990.901 1725.3843
rmse 70.64631 41.537746
rmsle NaN NaN
6.3.3 Comparison (RMSE: Lower = Better)
d_eval_tmp <- data.frame(model = "Best Model from H2O AutoML",
RMSE_cv = automl_DPOL@leader@model$cross_validation_metrics@metrics$RMSE,
RMSE_test = h2o.rmse(h2o.performance(automl_DPOL@leader, newdata = h_test)))
d_eval <- rbind(d_eval, d_eval_tmp)
datatable(d_eval, rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)6.4 Make Predictions
yhat_test <- h2o.predict(automl_DPOL@leader, newdata = h_test)
head(yhat_test, 5) predict
1 35.388011
2 2.586987
3 577.323965
4 3.505565
5 4.594125
6.5 Explainable AI - Target “DPOL”
6.5.1 Global Explanations
# Explain Best Model from AutoML
h2o.explain(automl_DPOL@leader, newdata = h_test)
Residual Analysis
=================
> Residual Analysis plots the fitted values vs residuals on a test dataset. Ideally, residuals should be randomly distributed. Patterns in this plot can indicate potential problems with the model selection, e.g., using simpler model than necessary, not accounting for heteroscedasticity, autocorrelation, etc. Note that if you see "striped" lines of residuals, that is an artifact of having an integer valued (vs a real valued) response variable.
Variable Importance
===================
> The variable importance plot shows the relative importance of the most important variables in the model.
SHAP Summary
============
> SHAP summary plot shows the contribution of the features for each instance (row of data). The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function.
Partial Dependence Plots
========================
> Partial dependence plot (PDP) gives a graphical depiction of the marginal effect of a variable on the response. The effect of a variable is measured in change in the mean response. PDP assumes independence between the feature for which is the PDP computed and the rest.
Individual Conditional Expectations
===================================
> An Individual Conditional Expectation (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response. ICE plots are similar to partial dependence plots (PDP); PDP shows the average effect of a feature while ICE plot shows the effect for a single instance. This function will plot the effect for each decile. In contrast to the PDP, ICE plots can provide more insight, especially when there is stronger feature interaction.
6.5.2 Local Explanations
h2o.explain_row(automl_DPOL@leader, newdata = h_test, row_index = 1)
SHAP explanation
================
> SHAP explanation shows contribution of features for a given instance. The sum of the feature contributions and the bias term is equal to the raw prediction of the model, i.e., prediction before applying inverse link function. H2O implements TreeSHAP which when the features are correlated, can increase contribution of a feature that had no influence on the prediction.
Individual Conditional Expectations
===================================
> Individual conditional expectations (ICE) plot gives a graphical depiction of the marginal effect of a variable on the response for a given row. ICE plot is similar to partial dependence plot (PDP), PDP shows the average effect of a feature while ICE plot shows the effect for a single instance.
6.5.3 Local Contributions (for Tree-based Models)
predictions <- h2o.predict(automl_DPOL@leader, newdata = h_test)
as.data.frame(head(predictions, 5)) predict
1 35.388011
2 2.586987
3 577.323965
4 3.505565
5 4.594125
contributions <- h2o.predict_contributions(automl_DPOL@leader, newdata = h_test)
datatable(as.data.frame(head(contributions)), rownames = FALSE, options = list(pageLength = 10, scrollX = TRUE, round)) %>%
formatRound(columns = -1, digits = 4)7 Your Turn - Target “DBUG”
Get your hands dirty!